Statistical distance between texts and filtration methods in sequence comparison

نویسنده

  • Pavel A. Pevzner
چکیده

Upon searching local similarities in long sequences, the necessity of a 'rapid' similarity search becomes acute. Quadratic complexity of dynamic programming algorithms forces the employment of filtration methods that allow elimination of the sequences with a low similarity level. The paper is devoted to the theoretical substantiations of the filtration method based on the statistical distance between texts. The notion of the filtration efficiency is introduced and the efficiency of several filters is estimated. It is shown that the efficiency of the statistical l-tuple filtration upon DNA database search is associated with a potential extension of the original four-letter alphabet and grows exponentially with increasing l. The formula that allows one to estimate the filtration parameters is presented.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Efficient Filtration Method in Biological Sequence Databases

Sequence comparison is one of the most important primitive operations in bioinformatics. Roughly speaking, this operation finds which parts of sequences are alike and which parts are different. As the size of a sequence database scales to millions of base pairs, it becomes impractical to search the whole database with sequence alignment methods based on the dynamic programming approach which yi...

متن کامل

Study on Genetic Diversity of Terminal Fragment Sequence of Isolated Persian Tobacco Mosaic Virus

Tobacco mosaic virus (TMV) is one of the devastating plant viruses in the world that infects more than 200 plant species. Movement protein plays a supportive role in the movement of other plant viruses, and viral coat protein is highly expressed in infected plants and affects replication and movements of TMV. In order to investigate genetic variation in the terminal fragment sequence in Iranian...

متن کامل

An Evolutionary and Phylogenetic Study of the BMP15 Gene

DNA sequence data contains a wealth of biologically useful information. Recent innovations in DNA sequencing technology have greatly increased our capacity to determine massive amounts of nucleotide sequences. These sequences can be used to specify the characteristics of different regions, interpret the evolutionary relationships between categorized groups, likelihood of performing multiple com...

متن کامل

Vietnamese Text Accent Restoration with Statistical Machine Translation

Vietnamese accentless texts exist on parallel with official vietnamese documents and play an important role in instant message, mobile SMS and online searching. Understanding correctly these texts is not simple because of the lexical ambiguity caused by the diversity in adding diacritics to a given accentless sequence. There have been some methods for solving the vietnamese accentless texts pro...

متن کامل

Statistical Methods for the Analysis of Repeated Measurements

The best ebooks about Statistical Methods For The Analysis Of Repeated Measurements that you can get for free here by download this Statistical Methods For The Analysis Of Repeated Measurements and save to your desktop. This ebooks is under topic such as statistical methods for the analysis of repeated measurements statistical methods for the analysis of repeated measurements statistical method...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computer applications in the biosciences : CABIOS

دوره 8 2  شماره 

صفحات  -

تاریخ انتشار 1992